library(ncdf4)
library(lubridate)
library(ggplot2)
library(fields)
library(rvest)
library(raster)
setwd(here::here())
main_folder = getwd()Investigation of Terrestrial Water Storage and Precipitation Observations
This document is a product of the final project for the STAT 570 lecture, focusing on data handling and visualization tools. It is essential to acknowledge that minor errors may be present, and the methods employed may not necessarily reflect the optimal approach related to the dataset.
Çağatay Çakan cakan.cagatay@metu.edu.tr
1 Introduction
Water is one of the essential resources in the world. It is used for many purposes such as agricultural, industrial, and domestic activities. The excessive use of water could be a problem in the future. For this reason, it should be tracked. Investigation of terrestrial water storage (TWS) is one way of tracking the water. TWS is defined as the total amount of water stored on land. In other words, TWS can be described as the sum of snow, ice, surface water, soil moisture, and groundwater. I want to focus on the cause of terrestrial water storage variability. The change in the storage can be explained by the water balance equation. The formula of the change in the storage is following:
\[ dS/dt = P - ET - R \]
where, \(dS/dt\) is the change in the storage, \(P\) is the precipitation, \(ET\) is the evapotranspiration, and \(R\) is the runoff. Assuming the relationship between \(P\) and \(ET+R\) is constant for a region, thus the variability of the storage come from precipitation.
2 Data
2.1 TWS Observations
TWS observations are obtained from the Gravity Recovery and Climate Experiment (GRACE) and GRACE Follow-On (GRACE-FO). They are two missions and each mission has twin satellites. GRACE operated between April 2002 and July 2017, and GRACE-FO is launched May 2018 and the data is available up to date. The working principle of these satellite is related to gravity and surface mass. They measure temporal variations of Earth’s gravitational potential. After removing the atmospheric and oceanic effects, the remaining signal on monthly to inter-annual timescales is mostly related to variations of TWS. The data set have monthly temporal resolution and 1° spatial resolution. However, the preprocessed version of the data is available which is JPL GRACE and GRACE-FO Mascon Product. This data has spatial resolution of 0.5°.
2.2 Precipitation Observations
The precipitation observations are obtained from the Global Precipitation Climatology Project (GPCP) Monthly Analysis Product. The precipitation data has monthly temporal resolution. However, its spatial resolution is 2.5°.
3 Data Handling
The TWS and precipitation observation data will be handled in this part. ncdf4 package is used to handle netcdf file. lubridate package is used for date. ggplot2 package is used for figures. fields package is used for image.plot function. rvest package is used to handle html files. raster package is used for resample function.
3.1 TWS
3.1.1 Downloading TWS Data
The GRACE and GRACE-FO TWS data as the Mascon Product can be downloaded from here. After clicking the data, you should sign in and the data will be downloaded automatically. First, what the data contains should be checked.
3.1.2 Handling TWS Data
inname110 <- paste0(main_folder, '/01-GRACE/GRCTellus.JPL.200204_202310.GLO.RL06.1M.MSCNv03CRI.nc')
nc_file1 <- nc_open(inname110)The downloaded data are in netcdf format. This type of format is used in hydro-meteorology a lot. The view of this format in R can be seen in the below figure.
The lwe_thickness variable is used as TWS data for this dataset.
# The units of the variable
nc_file1[["var"]][["lwe_thickness"]][["units"]][1] "cm"
# The dimensions of the variable
nc_file1[["var"]][["lwe_thickness"]][["varsize"]][1] 720 360 226
The units of the data is cm. The data has 0.5° spatial resolution, so longitude data contains 360/0.5 = 720 points, latitude data contains 180/0.5 = 360 points. Up to this point, everything is perfect. However, the last column shows 226 points. The observed period is between 2002 and 2023. So there are 22 years, in other words there are 264 months. Thus, there are some missing months, but which ones are missing? Probably, up to the April 2002, there are missing 3 months because the mission was not started at that period, and after the October 2023, there are missing 2 months because the data are not available for that period. What about others? How could you decide which months are missing? Check the time variable!
nc_file1[["dim"]][["time"]]$name
[1] "time"
$len
[1] 226
$unlim
[1] FALSE
$group_index
[1] 1
$group_id
[1] 65536
$id
[1] 2
$dimvarid
$id
[1] 2
$group_index
[1] 1
$group_id
[1] 65536
$list_index
[1] -1
$isdimvar
[1] TRUE
attr(,"class")
[1] "ncid4"
$units
[1] "days since 2002-01-01T00:00:00Z"
$calendar
[1] "gregorian"
$vals
[1] 106.5 129.5 227.5 258.0 288.5 319.0 349.5 380.5 410.0 439.5
[11] 470.0 495.5 561.5 592.5 623.0 653.0 684.0 714.5 736.5 777.0
[21] 805.5 836.0 866.5 897.0 927.5 958.5 989.0 1019.5 1050.0 1080.5
[31] 1111.5 1141.0 1170.5 1201.0 1231.5 1262.0 1292.5 1323.5 1354.0 1384.5
[41] 1415.0 1445.5 1476.5 1506.0 1535.5 1566.0 1596.5 1627.0 1657.5 1688.5
[51] 1719.0 1749.5 1780.0 1810.5 1841.5 1871.0 1900.5 1931.0 1961.5 1992.0
[61] 2022.5 2053.5 2084.0 2114.5 2145.0 2175.5 2206.5 2236.5 2266.5 2297.0
[71] 2327.5 2358.0 2388.5 2419.5 2450.0 2480.5 2511.0 2541.5 2572.5 2602.0
[81] 2631.5 2662.0 2692.5 2723.0 2753.5 2784.5 2815.0 2845.5 2876.0 2906.5
[91] 2937.5 2967.0 2996.5 3027.0 3057.5 3088.0 3118.5 3149.5 3180.0 3210.5
[101] 3241.0 3269.5 3335.5 3361.5 3392.0 3422.5 3485.5 3514.5 3545.0 3575.5
[111] 3591.5 3652.0 3667.5 3697.5 3727.5 3746.5 3819.0 3849.5 3880.5 3908.0
[121] 3974.5 4002.5 4033.5 4062.0 4128.0 4153.5 4184.0 4214.5 4306.5 4337.0
[131] 4367.5 4391.5 4457.5 4488.0 4518.5 4546.0 4610.5 4641.0 4671.5 4703.0
[141] 4769.5 4793.0 4822.5 4853.0 4864.0 4943.5 4975.5 5004.5 5104.5 5128.5
[151] 5157.0 5188.5 5253.0 5280.0 5309.5 5346.5 5444.5 5471.5 5499.0 5568.5
[161] 5592.5 5610.5 5640.0 6010.0 6034.0 6147.5 6163.0 6193.5 6224.5 6253.0
[171] 6283.5 6314.0 6344.5 6375.0 6405.5 6436.5 6467.0 6497.5 6528.0 6558.5
[181] 6589.5 6619.5 6649.5 6680.0 6710.5 6741.0 6771.5 6802.5 6833.0 6863.5
[191] 6894.0 6924.5 6955.5 6985.0 7014.5 7045.0 7075.5 7106.0 7136.5 7167.5
[201] 7198.0 7228.5 7259.0 7289.5 7320.5 7350.0 7379.5 7410.0 7440.5 7471.0
[211] 7501.5 7532.5 7563.0 7593.5 7624.0 7654.5 7685.5 7715.0 7744.5 7775.0
[221] 7805.5 7836.0 7866.5 7897.5 7928.0 7958.5
$create_dimvar
[1] TRUE
attr(,"class")
[1] "ncdim4"
The units of the time is day since 2002-01-01T00:00:00Z, and there are values. These values are represent the day since January 1st, 2002. Thus, the exact location of the data could be found, and relocate the data called GRF_TWS.
start_date <- as.Date('2002-01-01')
nc_tws <- ncvar_get(nc_file1, 'lwe_thickness')
nc_tim <- ncvar_get(nc_file1, 'time')
TWS_lon <- ncvar_get(nc_file1, 'lon')
TWS_lat <- ncvar_get(nc_file1, 'lat')
timlen <- length(nc_tim)
GRF_TWS <- array(NA, dim = c(720,360,12,22))
for (j in 1:timlen) {
nc_date <- start_date + nc_tim[j]
nc_mon0 <- as.numeric(format(nc_date, format = "%m"))
nc_year <- as.numeric(format(nc_date, format = "%Y"))
year_loc <- nc_year - 2002 + 1
GRF_TWS[,,nc_mon0,year_loc] = nc_tws[,,j]
}Check the time series of the global spatial mean of TWS.
GRF_TWS_3D <- GRF_TWS
dim(GRF_TWS_3D) <- c(dim(GRF_TWS)[1], dim(GRF_TWS)[2], dim(GRF_TWS)[4]*12)
TWS_TS <- colMeans(GRF_TWS_3D, na.rm = T, dims = 2)
start_date <- ymd("2002-01-01")
end_date <- ymd("2023-12-31")
TWS_date <- seq(start_date, end_date, by = "months")
TWS_df <- data.frame(TWS_date, TWS_TS)
ggplot(TWS_df, aes(x = TWS_date, y = TWS_TS)) +
geom_line() +
scale_x_date(limit = c(as.Date("2002-01-01"), as.Date("2025-06-01"))) +
labs(x = "", y = "Spatial Mean of TWS (cm)")The missing data are shown in this figure. The reason for this is that there is a battery problem in the satellites especially after 2011. Also, there is a gap between May 2017 and July 2018 because there is no satellite for that period. Let’s see what the data looks like.
GRF_TWS_2D <- rowMeans(GRF_TWS, na.rm = T, dims = 2)
image.plot(TWS_lon, TWS_lat, GRF_TWS_2D, xlab = 'longitude',
ylab = 'latitude', main = 'Temporal Mean of TWS (cm)')Warning in temp$x: partial match of 'x' to 'xlab'
Warning in temp$y: partial match of 'y' to 'ylab'
The shape of the figure is not clear, so some mask should be applied. There is a land_mask in the data. Let’s use this land_mask:
nc_land_mask1 <- ncvar_get(nc_file1, 'land_mask')
nc_land_mask2 <- nc_land_mask1
nc_land_mask2[nc_land_mask2 == 0] <- NA
land_mask <- array(nc_land_mask2, dim = dim(GRF_TWS))
GRF_TWS_masked <- GRF_TWS*land_mask
GRF_TWS_masked_2D <- rowMeans(GRF_TWS_masked, na.rm = T, dims = 2)
image.plot(TWS_lon, TWS_lat, GRF_TWS_masked_2D,
xlab = 'longitude', ylab = 'latitude',
main = 'Temporal Mean of TWS (cm)')Warning in temp$x: partial match of 'x' to 'xlab'
Warning in temp$y: partial match of 'y' to 'ylab'
The representation of the globe is not suitable for the hydro-meteorological variables. Their longitude is going from -180 to 180.
dimx <- dim(GRF_TWS_masked)[1]
GRF_TWS_masked2 <- GRF_TWS_masked
for (xx in 1:dimx) {
if (xx <= 360) {GRF_TWS_masked2[xx,,,] <- GRF_TWS_masked[xx+360,,,]}
if (xx > 360) {GRF_TWS_masked2[xx,,,] <- GRF_TWS_masked[xx-360,,,]}
}
TWS_lat2 = TWS_lat
TWS_lon2 = seq(-179.75, 179.75, 0.5)Check the image again.
GRF_TWS_masked_2_2D <- rowMeans(GRF_TWS_masked2, na.rm = T, dims = 2)
image.plot(TWS_lon2, TWS_lat2, GRF_TWS_masked_2_2D,
xlab = 'longitude', ylab = 'latitude',
main = 'Temporal Mean of TWS (cm)')Warning in temp$x: partial match of 'x' to 'xlab'
Warning in temp$y: partial match of 'y' to 'ylab'
That’s perfect. The TWS data is ready for the analysis part. However, let’s look at the variability of temporal mean of TWS data.
GRF_TWS_masked_2_2D <- rowMeans(GRF_TWS_masked2, na.rm = T, dims = 2)
image.plot(TWS_lon2, TWS_lat2, GRF_TWS_masked_2_2D, xlab = 'longitude',
ylab = 'latitude', main = 'Temporal Mean of TWS (cm)',
zlim = c(-30,30))Warning in temp$x: partial match of 'x' to 'xlab'
Warning in temp$y: partial match of 'y' to 'ylab'
Warning in temp$z: partial match of 'z' to 'zlim'
3.2 Precipitation
3.2.1 Downloading Precipitation Data
The GPCP precipitation observations could be downloaded from here. However, there is a problem to download the data. Each year has separate folders, and these folders include separate files. Moreover, there is no specific pattern of the files. For this reason, read_html function was used to extract the file names in each folder.
years <- 2002:2023
len_year <- length(years)
url01 = 'https://www.ncei.noaa.gov/data/global-precipitation-climatology-project-gpcp-monthly/access/'
for (yyyy in 1:len_year) {
url02 <- paste0(url01, years[yyyy],'/')
file_list02 <- read_html(url02) |>
html_nodes("a") |>
html_text(trim = T)
file_list_gpcp <- file_list02[grepl('gpcp_',file_list02)]
len_file2 <- length(file_list_gpcp)
for (ff02 in 1:len_file2) {
inname02 <- paste0(url02, file_list_gpcp[ff02])
outname02 <- paste0(main_folder,'/02-GPCP-Precipitation/',
file_list_gpcp[ff02])
if (file.exists(outname02)) {next}
download.file(inname02, outname02, mode="wb")
}
}3.2.2 Handling of Precipitation Data
The view of this format in R can be seen in the figure.
The detail of the precipitation data is in the precip variable so it should be extracted. The view of this variable can be seen in the figure.
precip VariableThe dimensions of the data for one month can be seen in this figure. The dimensions are 144, 72. Precipitation observations were downloaded for each month separately between 2002 and 2023. For easier use in later stages of the study, precipitation data was assigned to a single variable called GPCP_PRCP. In this step, longitude and latitude of the data were extracted.
infold1100 <- paste0(main_folder, '/02-GPCP-Precipitation/')
file_list1 <- list.files(infold1100)
len_file1 <- length(file_list1)
file_list2 <- file_list1[1:258]
len_file2 <- length(file_list2)
dim_year <- ceiling(len_file2/12)
GPCP_PRCP <- array(NA, dim = c(144,72,12,dim_year))
inname0000 <- paste0(infold1100, file_list2[1])
nc_file2 <- nc_open(inname0000)
nc_close(nc_file2)
for (ff01 in 1:len_file2) {
inname0000 <- paste0(infold1100, file_list2[ff01])
year <- as.numeric(substr(file_list2[ff01],22,25))
month <- as.numeric(substr(file_list2[ff01],26,27))
year_loc <- year-2002+1
nc_file3 <- nc_open(inname0000)
GPCP_PRCP0 <- ncvar_get(nc_file3, 'precip')
PRCP_lon <- ncvar_get(nc_file3, 'longitude')
PRCP_lat <- ncvar_get(nc_file3, 'latitude')
nc_close(nc_file3)
GPCP_PRCP[,,month,year_loc] <- GPCP_PRCP0
}
head(PRCP_lon)[1] 1.25 3.75 6.25 8.75 11.25 13.75
tail(PRCP_lon)[1] 346.25 348.75 351.25 353.75 356.25 358.75
head(PRCP_lat)[1] -88.75 -86.25 -83.75 -81.25 -78.75 -76.25
tail(PRCP_lat)[1] 76.25 78.75 81.25 83.75 86.25 88.75
The longitude of the data starts from 1.25° and going up to 358.75°, while the latitude of the data starts from -88.75° and going up to 88.75°.
Since TWS observations have 1° spatial resolution, the precipitation data was down scaled from 2.5° to 1° using bilinear interpolation using raster package. The unit of the precip variable in the netcdf file is mm/day. However, the unit should be mm/month. Thus, each data is multiple by the days in the corresponding month.
dimx <- dim(GPCP_PRCP)[1]
dimy <- dim(GPCP_PRCP)[2]
dim4 <- dim(GPCP_PRCP)[4]
r <- raster(nrow = 720, ncol = 360)Warning in .local(...): partial argument match of 'nrow' to 'nrows'
Warning in .local(...): partial argument match of 'ncol' to 'ncols'
extent(r) <- extent(c(0,1,0,1))
GPCP_PRCP_Downscaled <- array(NA, dim = c(720, 360, 12, dim4))
daymon <- c(31,28,31,30,31,30,31,31,30,31,30,31)
for (yyyy in 1:dim4) {
for (mm in 1:12) {
PRCP1 <- GPCP_PRCP[,,mm,yyyy]
PRCP2 <- raster(PRCP1)
PRCP3 <- resample(PRCP2, r, metod = 'bilinear')
PRCP4 <- as.matrix(PRCP3)*daymon[mm]
GPCP_PRCP_Downscaled[,,mm,yyyy] <- PRCP4
}
}Check the time series of the spatial mean of precipitation.
GPCP_PRCP_Downscaled_3D <- GPCP_PRCP_Downscaled
dim(GPCP_PRCP_Downscaled_3D) <-
c(dim(GPCP_PRCP_Downscaled)[1], dim(GPCP_PRCP_Downscaled)[2],
dim(GPCP_PRCP_Downscaled)[4]*12)
GPCP_PRCP_TS <- colMeans(GPCP_PRCP_Downscaled_3D, na.rm = T, dims = 2)
start_date <- ymd("2002-01-01")
end_date <- ymd("2023-12-31")
GPCP_PRCP_date <- seq(start_date, end_date, by = "months")
GPCP_PRCP_df <- data.frame(GPCP_PRCP_date, GPCP_PRCP_TS)
ggplot(GPCP_PRCP_df, aes(x = GPCP_PRCP_date, y = GPCP_PRCP_TS)) +
geom_line() +
scale_x_date(limit = c(as.Date("2002-01-01"), as.Date("2025-06-01"))) +
labs(x = "", y = "Spatial Mean of Precipitation (mm)")Let’s see what the data looks like.
PRCP_lon <- seq(0.25, 359.75, 0.5)
PRCP_lat <- seq(-89.75, 89.75, 0.5)
GPCP_PRCP_Downscaled_2D <- rowMeans(GPCP_PRCP_Downscaled, na.rm = T, dims = 2)
image.plot(PRCP_lon, PRCP_lat, GPCP_PRCP_Downscaled_2D,
xlab = 'longitude', ylab = 'latitude',
main = 'Temporal Mean of Precipitation (mm)')Warning in temp$x: partial match of 'x' to 'xlab'
Warning in temp$y: partial match of 'y' to 'ylab'
The shape of the figure is not clear, so some mask should be applied. Same land_mask with TWS data can be applied.
GPCP_PRCP_Downscaled_masked <- GPCP_PRCP_Downscaled*land_mask
GPCP_PRCP_Downscaled_masked_2D <- rowMeans(GPCP_PRCP_Downscaled_masked, na.rm = T, dims = 2)
image.plot(PRCP_lon, PRCP_lat, GPCP_PRCP_Downscaled_masked_2D,
xlab = 'longitude', ylab = 'latitude',
main = 'Temporal Mean of Precipitation (mm)')Warning in temp$x: partial match of 'x' to 'xlab'
Warning in temp$y: partial match of 'y' to 'ylab'
The representation of the globe is not suitable for the hydro-meteorological variables. Their longitude is going from -180 to 180.
dimx <- dim(GPCP_PRCP_Downscaled_masked)[1]
GPCP_PRCP_Downscaled_masked2 = GPCP_PRCP_Downscaled_masked
for (xx in 1:dimx) {
if (xx <= 360) {GPCP_PRCP_Downscaled_masked2[xx,,,] <- GPCP_PRCP_Downscaled_masked[xx+360,,,]}
if (xx > 360) {GPCP_PRCP_Downscaled_masked2[xx,,,] <- GPCP_PRCP_Downscaled_masked[xx-360,,,]}
}
PRCP_lat2 = PRCP_lat
PRCP_lon2 = seq(-179.75, 179.75, 0.5)Check the image again.
GPCP_PRCP_Downscaled_masked2_2D <- rowMeans(GPCP_PRCP_Downscaled_masked2, na.rm = T, dims = 2)
image.plot(PRCP_lon2, PRCP_lat2, GPCP_PRCP_Downscaled_masked2_2D,
xlab = 'longitude', ylab = 'latitude',
main = 'Temporal Mean of Precipitation (mm)')Warning in temp$x: partial match of 'x' to 'xlab'
Warning in temp$y: partial match of 'y' to 'ylab'
That’s perfect. The precipitation data is ready for the analysis part.